Setup

What is the tidyverse?

The tidyverse consists of a few key packages for data import, manipulation, visualization and more.

library(tidyverse)

Objects and Classes

x = 1:3
y = 'a'
z = list(one = x, two = y)

x
y
z
str(z)
class(y)

Functions

A function to show how easy it is to create your own.

my_sum_times_two <- function(x, y) {
  2 * sum(x, y)
}

my_sum_times_two(1, 2)

Data Structures

Vectors form the basis of R data structures. Two main types are atomic and lists.

my_vector <- c(1, 2, 3)   # standard vector
my_list <- list(a = 1, b = 2)   # a named list
my_list

Data frames

Data frames are a special kind of list, and probably the most commonly used for data science purposes.

my_data = data.frame(
  id = 1:3,
  name = c('Vernon', 'Ace', 'Cora')
)

my_data
class(my_data)

Importing Data

Importing data is usually the first step.

demographics = read.csv('data/demos_anonymized.csv')
ids = read.csv('data/ids_anonymized.csv')

Working with Databases

Databases must be connected to, but otherwise are used just like data frames.

# requires DBI and RSQLite packages; just for demo
library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
# con

copy_to(con, demographics, 'demos')

Selecting Columns

A common step is to subset the data by column.

demographics %>% 
  select(gender, age, libuser)
demographics %>% 
  select(-libuser)
demographics %>% 
  select(starts_with('award'))

Filtering Rows

To filtering data, think of a logical statement, something that can be TRUE or FALSE.

my_filtered_data = demographics %>% 
  filter(age < 40)

my_filtered_data = demographics %>% 
  filter(libuser == 1)

Generating new data

Another very common data processing task is to generate new variables.

demographics = demographics %>% 
  mutate(new_age = (age - mean(age, na.rm = T))/sd(age, na.rm = T))   

Renaming columns

demographics = demographics %>% 
  rename(age_std = new_age)
demographics %>% 
  rename_all(toupper) %>% 
  colnames()

Merging

Merging data can take on a variety of forms, and depending on the data, can be be quite complicated.

# same N rows as demos
left_join(demographics, ids)

# only ~ 50k rows
inner_join(demographics, ids) 

Exercises

Selecting and filtering

Use the : operator to select successive columns.

colnames(demographics)

demographics %>% 
  select(?)

Filter the data to award amounts less than 500000.

demographics %>% 
  filter(award_total_amount ?)

Generating new data

Generate a new award amount variable that is the log of the original. Give the new variable a useful name.

demographics %>% 
  mutate(? = log(?))

Python examples

Using Python for data science is not far removed from R. Python’s main data processing module is pandas, which serves as a means to provide R-like data frames to the world of Python.

Import

# note how when using something other than R, you have to specify the engine path
import pandas as pd
import numpy  as np

demographics = pd.read_csv('data/demos_anonymized.csv')
ids = pd.read_csv('data/ids_anonymized.csv')

demographics.head()  # show a few lines

Selecting Columns

# select by name
demographics[['age', 'award_total_amount']]
# select successive columns
demographics.loc[:,'libuser':'age']
# select by pattern
demographics.filter(regex='^award') 

Filtering Rows

my_filtered_data = demographics[demographics.libuser == 1]
my_filtered_data.libuser.nunique()

Generating new data

demographics[['new_age']] = (demographics[['age']] - np.mean(demographics[['age']])) / np.std(demographics[['age']])

demographics.new_age.describe()  # mean = 0  sd = 1

Renaming columns

demographics = demographics %>% 
  rename(age_std = new_age)

Joins

demos_joined = pd.merge(demographics, ids, how='left', on='EMPLID')
demos_joined
demos_joined = demographics.join(ids, how='left', lsuffix='EMPLID')
demos_joined.shape
demos_joined.columns
demos_joined = demographics.join(ids, how='inner', lsuffix='EMPLID')

demos_joined.columns